initial prototype for vits2 #2838

p0p4k · 2023-08-04T23:53:23Z

To do some testing these are the notebooks at my other repo.
I still have some questions about the techniques used in the paper; they do not go into the specifics. Would be great if I can get some help.

@lexkoro thanks for good discussions.

p0p4k · 2023-08-04T23:58:20Z

#2828

erogol

Looks like WIP. Pls, tag me again when you need a review.

erogol · 2023-08-05T09:45:23Z

TTS/tts/layers/glow_tts/transformer.py

@@ -430,3 +430,133 @@ def forward(self, x, x_mask):
            x = self.norm_layers_2[i](x + y)
        x = x * x_mask
        return x
+
+class ConditionalRelativePositionTransformer(nn.Module):


Better if you move this into layers/vits2

erogol · 2023-08-05T09:50:47Z

TTS/tts/layers/vits2/networks.py

+    return int((kernel_size * dilation - dilation) / 2)
+
+
+class TextEncoder(nn.Module):


I guess they use regular transformer not relative position one.

p0p4k · 2023-08-05T10:00:05Z

Ahh, yes. I forgot to turn it into draft. I'll let you know as soon as it's done. Thanks!

manmay-nakhashi · 2023-08-07T18:03:46Z

    def forward_mas(self, outputs, z_p, m_p, logs_p, x, x_mask, y_mask, g, lang_emb, noise_scale=0.01):
        # find the alignment path
        attn_mask = torch.unsqueeze(x_mask, -1) * torch.unsqueeze(y_mask, 2)
        with torch.no_grad():
            o_scale = torch.exp(-2 * logs_p)
            logp1 = torch.sum(-0.5 * math.log(2 * math.pi) - logs_p, [1]).unsqueeze(-1)  # [b, t, 1]
            logp2 = torch.einsum("klm, kln -> kmn", [o_scale, -0.5 * (z_p**2)])
            logp3 = torch.einsum("klm, kln -> kmn", [m_p * o_scale, z_p])
            logp4 = torch.sum(-0.5 * (m_p**2) * o_scale, [1]).unsqueeze(-1)  # [b, t, 1]
            logp = logp2 + logp3 + logp1 + logp4
            
            # Adding Gaussian noise to the log probabilities
            epsilon = torch.randn_like(logp) * torch.std(logp) * noise_scale
            logp += epsilon
            
            attn = maximum_path(logp, attn_mask.squeeze(1)).unsqueeze(1).detach()  # [b, 1, t, t']

        # duration predictor
        attn_durations = attn.sum(3)
        if self.args.use_sdp:
            loss_duration = self.duration_predictor(
                x.detach() if self.args.detach_dp_input else x,
                x_mask,
                attn_durations,
                g=g.detach() if self.args.detach_dp_input and g is not None else g,
                lang_emb=lang_emb.detach() if self.args.detach_dp_input and lang_emb is not None else lang_emb,
            )
            loss_duration = loss_duration / torch.sum(x_mask)
        else:
            attn_log_durations = torch.log(attn_durations + 1e-6) * x_mask
            log_durations = self.duration_predictor(
                x.detach() if self.args.detach_dp_input else x,
                x_mask,
                g=g.detach() if self.args.detach_dp_input and g is not None else g,
                lang_emb=lang_emb.detach() if self.args.detach_dp_input and lang_emb is not None else lang_emb,
            )
            loss_duration = torch.sum((log_durations - attn_log_durations) ** 2, [1, 2]) / torch.sum(x_mask)
        outputs["loss_duration"] = loss_duration
        
        return outputs, attn

class VITS():
self.noise_decay=2e-6
self.noise_scale=0.01
we can update noise_scale -= noise_decay in on_train_step_start.
@p0p4k

erogol · 2023-09-07T12:45:20Z

@p0p4k If you don't mind, we need some testing like

TTS/tests/tts_tests/test_vits.py

Line 36 in cdc971f

class TestVits(unittest.TestCase):

erogol · 2023-09-27T07:49:47Z

@p0p4k all looks good to me. Can you add a recipe for people to get started with training VITS2 ?

p0p4k · 2023-09-27T08:36:23Z

@erogol Ok, I will do some cleanup , remove unused stuff and add a recipe with test soon. Little busy with other stuff! Thanks!

erogol · 2023-09-28T09:58:27Z

@p0p4k sure take your time...

stale · 2023-10-28T17:40:10Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

initial prototype for vits2

e65cab7

erogol reviewed Aug 5, 2023

View reviewed changes

p0p4k marked this pull request as draft August 5, 2023 11:25

added some modules, etc

865d2ea

p0p4k marked this pull request as ready for review September 6, 2023 06:17

stale bot added the wontfix This will not be worked on but feel free to help. label Oct 28, 2023

stale bot closed this Nov 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initial prototype for vits2 #2838

initial prototype for vits2 #2838

p0p4k commented Aug 4, 2023

p0p4k commented Aug 4, 2023

erogol left a comment

erogol Aug 5, 2023

erogol Aug 5, 2023

p0p4k commented Aug 5, 2023

manmay-nakhashi commented Aug 7, 2023

erogol commented Sep 7, 2023

erogol commented Sep 27, 2023

p0p4k commented Sep 27, 2023

erogol commented Sep 28, 2023

stale bot commented Oct 28, 2023

		return int((kernel_size * dilation - dilation) / 2)


		class TextEncoder(nn.Module):

initial prototype for vits2 #2838

initial prototype for vits2 #2838

Conversation

p0p4k commented Aug 4, 2023

p0p4k commented Aug 4, 2023

erogol left a comment

Choose a reason for hiding this comment

erogol Aug 5, 2023

Choose a reason for hiding this comment

erogol Aug 5, 2023

Choose a reason for hiding this comment

p0p4k commented Aug 5, 2023

manmay-nakhashi commented Aug 7, 2023

erogol commented Sep 7, 2023

erogol commented Sep 27, 2023

p0p4k commented Sep 27, 2023

erogol commented Sep 28, 2023

stale bot commented Oct 28, 2023